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Abstract 

The benefit of sexual recombination is one of the most fundamental questions both in population 
genetics and evolutionary computation. It is widely believed that recombination helps solving difficult 
optimization problems. We present the first result, which rigorously proves that it is beneficial to use 
sexual recombination in an uncertain environment with a noisy fitness function. For this, we model 
sexual recombination with a simple estimation of distribution algorithm called the Compact Genetic 
Algorithm (cGA), which we compare with the classical /r+ 1 EA. For a simple noisy fitness function with 
additive Gaussian posterior noise A/"(0, cr^), we prove that the mutation-only /r -I- 1 EA typically cannot 
handle noise in polynomial time for large enough while the cGA runs in polynomial time as long as 
the population size is not too small. This shows that in this uncertain environment sexual recombination 
is provably beneficial. We observe the same behavior in a small empirical study. 


1 Introduction 

Heuristic optimization is widely used in artificial intelligence for solving hard optimization problems, for 
which no efficient problem-specific algorithm is known. Such problems are typically very large, noisy and 
heavily constrained and cannot be solved by simple textbook algorithms. The inspiration for heuristic 
general-purpose problem solvers often comes from nature. A well-known example is simulated annealing, 
which is inspired from physical annealing in metallurgy. The largest and probably most successful class, 
however, are biologically-inspired algorithms, especially evolutionary algorithms. 

Evolutionary and genetic algorithms. Evolutionary Algorithms (EAs) were introduced in the 1960s and 
have been successfully applied to a wide range of complex engineering and combinatorial problems nnnnn]. 
Like Darwinian evolution in nature, evolutionary algorithms construct new solutions from old ones and select 
the fitter ones to continue to the next iteration. The construction of new solutions from old ones, so-called 
reproduction, can be asexual (mutation of a single individual) or sexual (crossover of two individuals). An 
EA which uses sexual reproduction is typically called Genetic Algorithm (GA). Since the beginning of EAs, it 
has been argued that GAs should be more powerful than pure EAs which use only asexual reproduction [12] . 
This was debated for decades, but theoretical results and explanations on crossover are still scarce. There are 
some results for simple artificial test functions, where it was proven that a GA asymptotically outperforms 
an EA without crossover [inilTll [331 [Ml [231119] and the other way around [27]. However, these artificial test 
functions are typically tailored to the specific algorithm and proof technique and the results give little insight 
on the advantage of sexual reproduction on realistic problems occurring in artificial intelligence. There are 
also a few theoretical results for problem-specific algorithms and representations, namely coloring problems 
inspired by the Ising model [50] and the all-pairs shortest path problem [5]. For a nice overview of different 
aspects where populations and sex are beneficial for optimization of static fitness functions, see |26j . 

Noisy search. Heuristic optimization methods are typically not used for simple problems, but for rather 
difficult problems in uncertain environments. Evolutionary algorithms are very popular in settings including 
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uncertainties; see [3] for a survey on examples in combinatorial optimization, but also m for an excellent 
survey also discussing different sources of uncertainty. Uncertainty can be modeled by a probabilistic fitness 
function, that is, a search point can have different fitness values each time it is evaluated. One way to deal 
with this is to replace fitness evaluations with an average of a (large) sample of fitness evaluations and then 
proceed as if there was no noise. We take a different route, accept the noise, and try to analyze how much 
noise can be overcome by EAs without further modifications. To do this in a rigorous manner, we assume 
additive posterior noise, that is, each time the fitness value of a search point is evaluated, we add a noise 
value drawn from some distribution. This model was studied in evolutionary algorithms without crossover 

in [H [Ml El HI]- 

Our results on graceful scaling. It has been observed that evolutionary algorithms benefit from sexual 
recombination on simple static problems. It has also been observed that evolutionary algorithms (EAs) work 
in uncertain environments. The important question, whether and how sexual recombination helps EAs on 
noisy problems, remained open so far. We introduce the concept of graceful scaling (Def. |T]) to measure how 
well a black-box optimization algorithm can handle noise. We first prove a sufficient condition for when a 
noise model is intractable for optimization by a the classical (/x-|-l)-EA (Theorem [5]) and show that this 
implies that this simple asexual algorithm does not scale gracefully for large Gaussian noise (Corollary [6]) . 
On the other hand, we study the compact GA (cGA), which strongly relies on recombination, and prove that 
this sexual algorithm can handle noise gracefully iTheorem llll) . These asymptotic results are complemented 
and matched by corresponding experiments (Section jT]). We observe empirically that especially the noise- 
oblivious variant of the cGA, which has no knowledge of properties of the added noise, performs especially 
well. This confirms our theoretical finding that sexual recombination is especially powerful in uncertain 
environments. 

Biological motivation. Another motivation for our work comes from a biological perspective. The ex¬ 
act analysis of sexual recombination in both natural biological populations and in evolutionary computation 
is extremely difficult. In the field of population genetics, researchers often study the effects of recombination 
by describing the dynamics of natural selection on a freely recombining population under linkage equilibrium 
in terms of the change in allele frequencies. Recently, several researchers have noticed a connection between 
these models and optimization algorithms such as EDAs [20] from the evolutionary computation community 
and the Multiplicative Weights Update Algorithm (MWUA) [1] also known from statistical machine learn¬ 
ing EE]. The cGA is an EDA that tracks allele frequencies by simulating a population of K individuals 
undergoing gene pool recombination |21] where offspring are produced essentially by performing crossover 
with all K individuals as parents. In this way, the cGA is reasonably similar to models used in population 
genetics for studying sexual recombining populations, and thus we hope that our results can illuminate some 
of the utility of crossover in the presence of noisy signals for adaptation. 

2 Preliminaries 

Let F be a family of pseudo-Boolean functions (F„)„g]N where each F„ is a set of functions /: {0,1}” —>■ R. 
Let D be a family of distributions {Dy)v^'s. such that for all Dy G D, F,{Dy) = 0. We define F with additive 
posterior D-noise as the set F[D] := {/„ + Dyi fn£ Fn, Dy G D}. 

Definition 1. An algorithm A scales gracefully with noise on F[D] if there is a polynomial q such that, for 
o-ll gn,v = fn + Dy G F[D], there exists a parameter setting p such that A{p) finds the optimum of fn using 
at most q(n,v) calls to gn,v 

In the remainder of the paper, we will study a particular function class (OneMax) and a particular noise 
distribution (Gaussian, parametrized by the variance). Let > 0. We define the noisy OneMax function 
0 M[£, 2 ] : {0,1}" —R := X !->■ ||x||i -I- Z where ||x||i := |{i: Xi = 1}| and Z is a normally distributed random 
variable Z ^ A/"(0, cr^) with zero mean and variance cr^. 

The following proposition gives tail bounds for Z by using standard estimates of the complementary error 
function [34] . 
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Proposition 2. Let Z be a zero-mean Gaussian random variable with variance . For all t > 0 we have 


Pt(Z<-() = ierfc(^) 


and asymptotically for large t > 0, 


Pr(Z < -t) 


1 + o(l) y/2TTt 


Definition 3. Let x,y € {0,1}". Without loss of generality, suppose ||a;||i — ||y||i = £ > 0. Since OM[o. 2 ] is 
a function of unitation, the probability that it misclassifies y as superior to x depends only on the so-ealled 
phenotypic distance 1. We define $: [n] U {0} —^ [0,1] as 


m = 


1/2 

Pr(£: I ||x||i 


£ = 0 , 

i) £ > 0; 


where £ is the event that OM[o.2] (a;) < OM[o.2] (y). 

Lemma 4. For any 0 < £ < n, $(£) > 4>(£ + 1). Moreover, assuming > 0, 

$(£)< 

Proof. Let x and y be chosen arbitrarily from the set of all length-n binary strings pairs with ||a;||i — ||y||i = £ 
for any £ G [n]. The event that OM[o.2] incorrectly classifies y as superior to x is equivalent to the event 
OM[^ 2 ](x) < OM[^ 2 ](y). 

Pr(OM[„2] (x) < OM[o. 2] (y)) = Pr (£ + {Zi - Z2) < 0), 

where Z\,Z 2 ~ A/"(0,(T^) are independent identically distributed. Letting Z* := Z\ — Z^, we have Z* ^ 
A/'(0, 2cr^) and $(£) = Pt{Z* < —£). Furthermore, $(£ + 1) = Pi{Z* < —(£ + 1)) < Pt(Z* < —£) = $(£). 
Finally, Pr (Z* < —£) < where we have applied Proposition [2j The claim 

follows from the bound 1 — a: > □ 


A sequence of events {£n} is said to hold with high probability (w.h.p.) if limn,_>.oo Pr(£’n) = 1. 


2.1 Algorithms 

Algorithms that operate in the presence of noise often depend on a priori knowledge of the noise intensity 
(measured by the variance). In such cases, the following scheme can always be used to transform such 
algorithms into one that has no knowledge of the noise character. Suppose A(cr^) is an algorithm that solves 
a noisy function with variance at most cr^ within Ts{a^) steps with probability at least 1 —(5. A noise-oblivious 
scheme for A is as follows. 


Algorithm 1: Noise-oblivious scheme for A 

1 % ^— 0^ 

2 repeat until solution found 

3 Run A(2®) for Ts{2^) steps; 

4 i ^ i-\-l; 


Claim. Suppose fn,v G F[D] is a function with unknown variance v. Fixing n, assume Ts grows at least 
linearly, but uniformly so. Then for any s G Z'*', the noise-oblivious scheme optimizes fn,v in o-t most Ts(2’^v) 
steps with probability at least 1 — i5®. 
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Proof. By the assumptions on T^, for all c,x, cTs{x) < Ts{cx) and so by induction, for any fc S IN, 
Ts{2^^^). Let phase i be the i-th time in the for loop of Algorithm [T] We pessimisti¬ 
cally suppose that the noise-oblivious scheme has not found a solution by phase log?; — 1. Then for the next 
s phases, the proposed variance is at least = v and the probability that one of these phases is successful 
is at least 1 — <5®. The total number of steps is at most Ts{2'‘) < Ts(2‘‘v). □ 

The {fJ. + 1)-EA, defined in Algorithm[5J is a simple mutation-only evolutionary algorithm that maintains 
a population of ^ solutions and uses elitist survival selection. 


Algorithm 2: The {fj. + 1)-EA 

1 t i — 

2 Pt ^ /r elements of {0,1}" u.a.r.; 

3 while termination criterion not met do 

4 Select X £ Pt u.a.r.; 

5 Create y by flipping each bit of x w/ probability 1/n; 

6 Pt+i ^ Pt U {j/} \ {z} where /(z) < f{v)yv £ Pt, 

7 t i — t -f 1; 


The compact genetic algorithm (cGA) [14] is a genetic algorithm that maintains a population of size K 
implicitly in memory. Rather than storing each individual separately, the cGA only keeps track of population 
allele frequencies and updates these frequencies during evolution. Offspring are generated according to 
these allele frequencies, which is similar to what occurs in a sexually-recombining population. Indeed, the 
offspring generation procedure can be viewed as so-called gene pool recombination introduced by Miihlenbein 
and Paafi [21j in which all K members participate in recombination. Since the cGA evolves a probability 
distribution, it is also a type of estimation of distribution algorithm (EDA). The correspondence between 
EDAs and models of sexually recombining populations has already been noted pO], and Harik et al. [M] 
demonstrate empirically that the behavior of the cGA is equivalent to a simple genetic algorithm at least on 
simple problems. 

The first rigorous analysis of the cGA is due to Droste [9] who gave a general runtime lower bound for 
all pseudo-Boolean functions, and a general upper bound for all linear pseudo-Boolean functions. Defined 
in AlgorithmlH the cGA maintains for all times t G IN a frequency vector {piu,P 2 ,t, ■ ■ ■ ,Pn,t) £ [0,1]". In 
the t-th iteration, two strings x and y are sampled independently from this distribution where Pr(a; = z) = 
Pr( 2 / = z) = (Hi- zi=iPi,t) X (Hi- 2 i=o(l ~ Pi,t)) for all z G {0,1}". The cGA then compares the objective 
values of x and y, and updates the distribution by advancing pi^t toward the component of the winning string 
by an additive term. 


Algorithm 3: The compact GA 

1 t i — 

2 Pl,t t— P2,t Pn,t t— 1/2; 

3 while termination criterion not met do 

4 

for i £ {1,..., nj do 

5 


^ 1 w/ prob. Pi^t, Xi ^ 0 w/ prob. 1 - pi^t 

6 

for f G {1,..., nj do 

7 


j/i ^ 1 w/ prob. pi^t, yi ^ Ow/ prob. 1 - pi^t 

8 

if f{x) < f{y) then swap x and y for i £ (1,... , nj do 

9 


if Xi > yi then pi,t+i £- Pi,t + l/AT; 

10 


if Xi < yi then Pi^t+i t- Pi,t - l/AT; 

11 


if Xi = yi then Pi,t+i t- Pi,u 

12 

t i — t -|- 1 ; 
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3 Results 


We derive rigorous bounds on the optimization time, defined as of the hrst hitting time of the process to the 
true optimal solution (1”) of OM[o- 2 ], on a mutation-only based approach and the compact genetic algorithm. 


3.1 Mutation-based Approach 

In this section we consider the {fJ.+ 1)-EA. We will first, in TheoremjSl give a sufficient condition for when a 
noise model is intractable for optimization by a (/r -I- 1)-EA. Then we will show that, in the case of additive 
posterior noise sampled from a Gaussian distribution, this condition is fulfilled if the noise is large enough, 
showing that the (/r -I- 1)-EA cannot deal with arbitrary Gaussian noise (see Gorollary[6]). 


Theorem 5. Let /r > 1 and D a distribution on R. Let Y he the random variable describing the minimum 
over p, independent copies of D. Suppose 


Pr(y > D + n) > 


1 

2{fi + 1 ) 


Consider optimization of OneMax with reevaluated additive posterior noise from D by (p, + 1)-EA with¬ 
out crossover. Then, for fi bounded from above by a polynomial, the optimum will not he evaluated after 
polynomially many iterations w.h.p. 

Proof. Eor all t and all * < n let Xj be the random variable describing the proportion of individuals in the 
population of iteration t with exactly i Is. Let c = 800, 6 = 20, a = (c — l)/c and a' = {c— 2)/c. We show 
by induction on t that 

Vt,Vi > an: E[Xl] < 

In other words, the expected number of individuals with i Is is decaying exponentially with i after an. This 
will give the desired result with a simple union bound over polynomially many time steps. 

The claim holds at the start of the algorithm with an application of Hoeffding’s Inequality for the number 
of Is in a random individual. Fix some value t and suppose the claim holds for that t. Let some value i > an 
be given and let x = We will now show < x by considering one generation of the (/r -I- I)-EA 

without crossover. 

We distinguish four cases depending on whether an individual with less than a'n Is has been selected for 
reproduction, with i — k Is for some k with 1 < k < n/c, with exactly i Is or with strictly more than i Is. 
For each of these cases we estimate the number of individuals that can be chosen to reproduce, as well as 
the probability for such an individual to produce an offspring with exactly i Is. The following table gives 
upper bounds for both values in all four cases; we will justify all these values below. 



Proportion 

Probability 

< a'n 

I 

— In n) 

= i — k 

xb’^ 

(2lcf 

= i 

X 

l/e + l/(c-l) 

> i 

x/ib-1) 

1 


Clearly the proportion of individuals with < a'n Is is bounded from above by 1; for such an individual 
with m Os, at least half of these Os need to flip, which has a probability of at most 2’"/n'"/^ = 
using m > n/c. For any k < njc, we get a bound of xb^ for the number of individuals with exactly i — k Is 
from the induction hypothesis; as these individuals have at most 2n/c many Os, the probability of flipping 
at least k of these to 1 is < (2/c)^. For an individual with exactly i Is to create an offspring with exactly i 
Is, we can either not flip any bit (with a probability tending to I/e) or we flip as many Is as Os; flipping k 
Is has a probability of at most 1/c (as i > an), thus we can bound the probability of creating an offspring 
with exactly i Is by 

OO 

1/e + ^ c”'' = I/e -I- l/(c - I). 

k=l 
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With a similar geometric sum we get that the number of individuals with > z Is is, using the induction 
hypothesis, at most x/{h — 1). 

From the table we can now deduce that the probability of producing an offspring with exactly i Is in 
iteration t is at most 



Using X > b we see that 2 c>(rtinn) asymptotically no impact on the sum. 
choice of b and c, we have 


1 1 1 
-1-7 + 7-7 

e c — 1 0—1 


1 

c/{2b) - 1 


< 1 / 2 . 


Furthermore, from our 


Thus, we have that we get less than x/2 individuals with exactly i Is in expectation, while the premise of 
the theorem gives that any individual has a probability of at least 1/2 to die in any given iteration. This 
shows that cannot go above x. □ 


We apply Theorem [5] to show that large noise levels make it impossible for the (/r + 1)-EA to efficiently 
optimize. 


Corollary 6. Consider optimization o/OMjo-ajfey //i + 1)-EA without crossover. Suppose and fi 

bounded from above by a polynomial in n. Then the optimum will not be evaluated after polynomially many 
iterations w.h.p. 


Proof. We set up to use Theorem O Let D ^ and let Y be the minimum over p. independent 

copies of D. We want to bound Pr(y > D + n). To that end we let < 0 and ti < to be such that 
Pr(I? < to) = 0.75//r and Pt{D < ti) = 1.5/^. Let A be the event that U > ti and to — n<D<ti—n and 
let B be the event that Y > to and D < to — n. Clearly, the events A and B are disjoint and are contained 
in the event that Y > D + n. From the asymptotic bounds stated in Proposition [5] and the lower bound 
on cr^ we see that to — n> to{l + similarly, to{l + < ti — n < to(l — This gives that 

Py{D < to — n) and Pr(to — n < D < ti — n) are both asymptotically 0.75//i, as they would be without the 
“—n”-terms; this uses the bound on p. Thus, we have asymptotically 


Pr(y > D + n)> Pr(A) -b Pt{B) 

^o_75r_i_5y 

MV M / 

1 

2(m + 1) 






The last step uses the bound 1 — a: > exp(—a;/(l — x)). 


□ 


3.2 Compact GA 

Let T* be the optimization time of the cGA on OM[o. 2 ], namely, the first time that it generates the underlying 
“true” optimal solution 1". We consider the stochastic process Xt = Pi,t and bound the optimization 

time by T = inf{t > 0: Xt — 0}. Clearly T* < T since the cGA produces 1" in the T-th iteration almost 
surely. However, T* and T can be infinite when there is a t < T* where pt^t = 0 since the process can never 
subsequently generate any string x with Xi = 1. To circumvent this, Droste [9] estimates E(T*) conditioned 
on the event that T* is finite, and then bounds the probability of finite T*. In this paper, we will prove that 
as long as K is large enough, the optimization time is finite (indeed, polynomial) with high probability. 

The following lemma is due to von Bahr and Esseen [32] and states an exact equality for the first absolute 
moment of a random variable Z in terms of its characteristic function <pz{t) = E(e*‘'^). 
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Lemma 7 (special case of Lemma 2 of |32]1. Let Z be a random variable with E(|Z|) < oo. Then 


1 /■“ 

E(|Z|) = - / {l-9\{^zm/t^dh 

^ J —oo 


where lH(z) is the real part o/z G C. 


Lemma 8. Let Q < a < \ he a constant, 
independent, 



Consider a random variable Z = Z\ Zi Z^, each Zi 

with probability pi(l — pi), 
with probability Pi(l — Pi), 
with probability 1 — 2pi{l — pi); 


with a < Pi < 1 for every i G {1,... ,n}. Then Pt{Z = 0) > and 


E{\Z\) > av^ 




Proof. Let f = \Zi\ + \Z 2 \ -\ -+ |Z„|. Then f is distributed as a Poisson-Binomial distribution with each 

success probability equal to 2pi{l — Pi) and 


Pr(Z = 0)=^Pr(C = fc) 

k=0 



2-k 


where {Jj 2 ) = 0 if A: = 1 (mod 2). This is the joint probability that exactly k of the Zi variables are nonzero, 

and exactly half of these are selected to be negative, the other half positive. Since (^,^ 2 ) vanishes at odd i, 
we can write 

Pr(Z = 0) = E V ) 2"''- 


is the fc-th central binomial coefficient, for which we have the well-known bound , so we can 


write 


Pr(Z = 0) > Pr(^ 


L"/2J 

= 0) + y Pr(^ = 2k)^ > ^ 
^ '2Vk 2i/fi 


Pr(^ is even). 


( 1 ) 


since < 1- To finish the proof, note that for any integer random variable X, Pr(X is even) = 

(1 -I- G{—l))/2, where G{z) = E(z^) is the probability generating function for X. For a Poisson-Binomial 
distribution with success probabilities <71,<72, ■ ■ ■, Qn, G(z) = nr=i(l “ ft + ft^^); so. 


Pr(^ is even) = - M -P ]^(1 - 2qi) 


i=l 


Finally, since qi = 2pi{l — pi) <1/2 for alH G {1,..., n}, Pr(^ is even) > 1/2 and the claimed bound on 
Pt{Z = 0) follows from ([T|). 

We now bound the first absolute moment of Z from below. For every S' C [n], denote as £$ the event 
that \Zi\ = 1 <;=^ i G S. We first calculate the expectation of \Z\ conditioned on these events. Since the 
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probabilities pi are independent, 


|£:5) = nE(e”‘^^ I ^s) 

i=i 

E(e*‘^ I fs) = n {[j ^ (Y + ^) + ^ ^]) 

= cost = (cost)l'^l 
jes 

where [P] is the Iverson bracket. So by Lemma [7l 

Ei\Z\\£s) = ^ r i^^|^dt = g(|5|), 

where g(fc) = Again applying bounds on the central binomial coefficient, g{k) > 

\J\k/2'\ > \fkj2. By the law of total expectation, 

n ^ n 

E(|Z|) = ^5(fc) ^ Pr(£:s)> ^ Pr(f5) = E(C)/^, (2) 

k=l SC[n]:|S|=/c k=l SC[rt]:|S|=fc 

Since ^ follows a Poisson-Binomial distribution with the j-th success probability equal to 2pi(l — Pi), and 
every Pi > a, 


Substituting this inequality into ([U completes the proof. □ 


E(C) = - Pi) > 2a [ n - ^ 


i=l 


i=l 


Lemma 9. Consider the eGA optimizing OM[o. 2 ] and let Xt be the stochastic process defined above. As¬ 
sume that there exists a constant a > 0 such that pi^t > a for all i G {1,... ,n} and that Xt > 0, then 
E (At — At+i I At) > SXt where 1/6 = 0[a^Ky/rij. 

Proof. Let x and y be the offspring generated in iteration t and Zt = ||x||i — Hylli. Then Zt = Y,iH- \-Zn,t 

where 

{ —1 if Xi = 0 and yt = 1, 

0 if Xi = yi, 

1 if Xi = 1 and yt = 0; 

Let £ denote the event that in line [51 the evaluation of OM[cr 2 ] correctly ranks x and y. Without loss of 
generality, suppose ||x||i > Hj/Hi. Then E(At+i — At | Xt,£) = E(|Zt|)/A. On the other hand, if OM[ct 2 ](x) 
evaluates to at most OM[cr 2 ] (y) during iteration t, the roles above are swapped and E(At+i — At | At A f) = 
—E(|Zt|)/A. By the law of total expectation, 

E(At - At+i I At) = (1 - 2 Pr(f)) . (3) 

Eor any i G [n], Pr(Zi_t = 1) = Pr(Zi^t = — 1) = Pi,t(l — Pi,t) and Pr(Zi_t = 0) is the inverse. Since we have 
assumed each pt t > a, we can apply Lemma [8] to obtain 

E(|^t|) > a-\/27n - ^Pi,t^ = aXt^/2jn. (4) 


To complete the proof, we substitute the inequality in Equation (|3]) into Equation ([3|) and use Lemma S] to 
bound Pr(f) = $(|||x||i — ||?/||i|) from above. □ 






Lemma 10. Consider the cGA optimizing OM[o. 2 ] with > 0. Let 0 < a < 1/2 be an arbitrary constant and 
T' = inin{t > 0: 3* S [n\,pi^t < a}- If K = uj{a'^^/rilogn), then for every polynomial poly(n), n sufficiently 
large, Pr(T' < poly(n)) is superpolynomially small. 

Proof Let i G [n] be arbitrary. Let {Yt : t > 0} be the stochastic process 1* = (1/2 — pi^t) K. We first argue 
that 

Y(Yt I Yi,..., Yt-i) < r*_i - (*) 

Vn 

Let X and y be the strings generated in iteration t of the cGA (lines S] and [5] of Algorithm |31) . We define 
X = {xi,X 2 , ■ ■ •, Xi-i,Xi+i,... Xn) to be the substring of x constructed by removing the Ath element and y 
similarly. Since each element of x and y is constructed independently, we can regard x, y, Xi, and yi to be 
independent. 

Note that E(Yt | Yi,..., Yt-i) = Yt-i + 5t where 5t £ {—1, 0,1}. Define i = ||x||i — ||y||i. We distinguish 
between the two events that l^l is nonzero or zero. 

Case \I\ > 0. Suppose without loss of generality that £ > 0 (i.e., ||x||i > ||y||i). So, = 0 if and only if 
Xi = yi- Moreover, St = —1 only in the event that (a) x^ = 1 and yi = 0 and x is accepted (in which case 
.^ = £ + 1), or (b) Xi = 0 and yt = 1 and x is not accepted (in which case £ = £—1). Event (a) occurs only if 
OM[cr 2 ] does not misclassify x and y, whereas event (b) occurs only if OM[ct 2 ] does misclassify x and y. Thus, 

Pr((5t = -1) = Pr(x, = 1,2/. = 0) (l - $(£ + 1)) + Pr(xi = 0, //. = !)$(£ - 1). 

Similarly, St = 1 only in the event that (a) Xi = 1 and yi = 0 but x is not accepted because x and y 
were misclassihed by OM[o- 2 ], or (b) Xi = 0 and yi = 1 and x is accepted because OM[£,. 2 ] ranked x and 
y correctly. Thus, Pr((5t = 1) = Pr(xi = I,//. = 0)<i)(£ + 1) + Pr(xi = 0,2/. = 1)(1 — $(£ — 1)). Since 
Pr(xi = 1,2/i = 0) = Pr(xi = 0,2/i = 1) = Pr(x. 2/*)/2, 

E(5t) = Pr{St = 1) - Pr((5t = -!) = - Pr(x. ^ y,) ($(£ - 1) - $(£ + 1)) < 0, 

where we apply Lemma [H We conclude that in this case, 

E(Yi |£V0,Yi,...,Yi_i) = Yi_i+E(^i) < Yt_i. 

Case £ = 0. In this case, if Xi = yi, then x = y and there is zero drift. Otherwise, Xi > yi and so ||x||i —|| 2 /||i = 
1, or yi > Xi and ||2/||i — ||a^||i = 1. The drift in this case only depends on whether or not OM[o. 2 ] misclassihes 
X and y. In particular, Pr(i5t = —1) = Pr(xi = I,2/i = 0)(1 — $(1)) + Pr(xi = 0,2/i = I)(I — d>(l)), and 
Pr((5t = 1) = Pr(xi = 1,2/i = 0)$(1) + Pr(xi = 0,2/i = I)d>(l). By Lemmadl 

Y{dt) = Pr((5i = 1) - Pr(5i = -I) 

= -Pr(x. 9^2/.)(1-2$(1)) 

< -D(cr“^)Pr(xi yi). 

For this case, E(Yt | £ = 0, Yi,..., Yt_i) = Yt_i + E(5t) < Yt_i - Ll{a~^)Pi{xi ^ y^). 

Applying the law of total expectation, E(Yt | Yi,..., Yt_i) is bounded above by 

Yt - D(cr“^) Pr(x. yi) Pr(£ = 0). 

It remains to bound Pr(£ = 0) = Pr(||x||i = ||2/||i)- We dehne a random variable Z = Z 2 -\ -h where 

{ +1 \lxj>yj, 

0 \ixj=yj, 

-1 iixjKyj. 
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So Pr(||a;||i = ||y||i) = Pi(Z = 0) > l/(4\/n — 1) by Lemma [5] since 0 < Pr(xj > yj) = Pr(xj < yj) = 
Pj{l — pj) <ll2 for all j G {2, ..., n}, proving the claim in Q. 

Note that {Yt: t G M} is a Markov chain on {—iL/2, —K/2 + 1,..., K/2 — 1, K/2'\ with Yi = 0. Let 
T = min{t: Yt > (1/2 — a)K}. In any iteration, if Xi = yt, then Yt = Yt-i. Thus, for an estimate of the 
upper bounds of T, we can ignore self-loops in the chain. 

More formally, let {Yt: t G N} be the restriction of Yt to iterations such that Yt ^ Yt-i. Similarly, let 
T = min{t: Yt > {1/2 — a)K{. The random variable T stochastically dominates the random variable T since 
removing equal moves can only make the process hit faster, i.e., Vt G IN, Pr(T > t)> Pr(T > t). Due to the 
above arguments, 

P{Yt I fi,.. .,Yt-i) = E(ft I X, ^ y,,Y„ ... ,Yt-i) 

= Yt- E{6t I Xi ^ yi) 

<Yt- Vt{a-^/y/^. 


By a refinement to the negative drift theorem of Oliveto and Witt [5H1 [M] (cf Theorem 3 of [TS]), since 
Yi=Yi=0 and \Yt - Yt+i\ = 1 < \/2, for all s > 0, 


Pr(T < s) < Pr(T' < s) < s exp ( — 


(l/2-a)iL|e| 
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with e = —fl{a~^/^/n). Since K = u}{a^^/n log n), Pt{T < s) = . 

So, for any polynomial s = poly(n), with probability superpolynomially close to one, Y^ has not yet 
reached a state larger than (1/2 — a)K, and so pi^t > a for all 1 < t < s. As this holds for arbitrary i, 
applying a union bound retains a superpolynomially small probability that any of the n frequencies have 
gone below a by s = poly(n) steps. □ 

Theorem 11. Consider the cGA optimizing OMjo-s] with variance cr^ > 0 for any constant c > 0. If 
K = uj{(j‘^y/nlogn), then with probability 1 — o(l), the cGA finds the optimum after O{Ka'^y/nlog Kn) 
steps. 

Proof. We will consider the drift of the stochastic process {Xt: t G IN} over the state space S C {0} U 
[^min;^max] where Xt = TL HenCe, ^min ~ 1/K. 

Fix a constant 0 < a < 1/2. We say the process has failed by time t if there exists some s < t 
and some i G [n] such that pi^g < a. Let T = min{t G IN: Aj = 0}. Assuming the process never fails, 
by Lemma [H the drift of {Xt: t G IN} in each step is bounded by E {Xt — Xt+i | At = s) > 6Xt where 
1/5 = 0{a^K^Jn). Hence, by tail bounds for the multiplicative drift theorem (see Doerr and Goldberg m), 
Pr (T > (ln(Ai/a;min) + r) /6) < e Choosing r = dlnn for any constant d > 0, the probability that 
T = Q{Ka-‘^y/nlogKn) is at most n~‘^. 

Letting £ be the event that the process has not failed by 0{Ka^^/n log Kn) steps, by the law of total 
probability, the hitting time of A* = 0 is bounded by O{Ka'^y/n log Kn) with probability (1 — n~‘^) Pr(f) = 
1 — o(l) where we can apply Lemma ITUl to bound the probability of £. □ 


4 Experiments 

In Section [3] we proved that the cGA scales gracefully with noise (see Def. [T]) on a simple noisy pseudo- 
Boolean function, whereas a mutation-only EA fails when the noise variance is too high. In this section, 
we seek to compare the performance of the cGA with a baseline hillclimber that uses explicit resampling to 
reduce the noise variance. 

Our baseline hillclimber is called resampling randomized local search (reRLS). Eor a particular variance 
cr^, reRLS estimates the true objective function value by performing 0{a^ log n) function calls for each search 
point m- It then hillclimbs on the estimated true objective function by flipping a single bit in each iteration 
and accepting points with equal or better estimated objectives. Both reRLS and the cGA require knowledge 
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of the true noise variance to collect enough samples (reRLS) or to set K properly (cGA). We also investigate 
the performance of these approaches in the corresponding noise oblivious setting as defined in Section [2d] 
(NO-reRLS and NO-cGA). 

We measure the performance of each procedure by the number of calls to the objective function until the 
true optimum 1" is generated. This performance metric is standard in the field of evolutionary computation 
because typically objective function evaluation is the most costly operation in terms of computation time. 
For the cGA, this is twice the number of iterations through the while loop in Algorithm |3l For reRLS, this 
is the number of iteration times the number of resamples necessary to obtain a suitable estimate of the true 
objective function value. 

The performance of each algorithm is plotted fixing n = 100 and controlling the variance in Figure [T] 
For each procedure and variance value we run each algorithm 100 times until the true optimum is found 
and collect the number of calls to the objective function for each run. The median run times and their 
interquartile ranges are plotted. We also plot the performance as a function of n (fixing = ^Jn) in 
Figure [5] Both results are plotted on log-log plots; Thus the cGA variants are an order of magnitude faster 
than the baseline. 

Figures S] and [3] correspond to figures [3] and [I] respectively, and depict the number of re-evaluations 
((NO-)reRLS) per iteration or the value of K ((NO-)cGA) that was sufficient for the respective algorithm to 
succeed. Note that the functions for the non-noise-oblivious algorithms have deterministic function values 
whereas the ones for the noise-oblivious versions are random variables. 


5 Conclusions 

In this paper we have examined the benefit of sexual recombination in evolutionary optimization in an 
uncertain environment. We introduce the concept of an algorithm scaling gracefully with noise. We rigorously 
proved that mutation-only evolutionary algorithms do not scale gracefully in the sense that they cannot 
optimize noisy functions in polynomial time when the noise intensity is sufficiently high. On the other hand, 
we proved that a simple estimation of distribution algorithm that uses gene pool recombination can always 
optimize noisy OneMax (OMjg.^]) in polynomial time, subject only to the condition that the noise variance 

is bounded by some polynomial in n. 

A common way to handle noisy objective functions is to modify the optimization algorithm to perform 
resampling in order to estimate the true value of the underlying objective function. We have also presented 
empirical results that show the sexual recombination algorithm optimizes OM[ct 2 ] an order of magnitude 
faster than a resampling hillclimber. Our results highlight the importance of understanding the influence 
of different search operators in uncertain environments, and suggest that algorithms such as the compact 
genetic algorithm that use sexual recombination are able to scale gracefully with noise. 


11 


number of evaluations of / 



Figure 1: Median run time as a function of noise variance for n = 100, 100 runs at each point. Shaded area 
denotes IQR. 
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Figure 2: Median run time as a function of n for = ^/n, 100 runs at each point. Shaded area denotes 
IQR. 
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Figure 3: Median of number of re-evaluations per iteration or median of iF as a function of noise variance 
for n = 100, 100 runs at each point. Shaded area denotes IQR. 


14 
















number of re-evaluations per iteration or K 



Figure 4: Median of number of re-evaluations per iteration or median of iF as a function of n for 
100 runs at each point. Shaded area denotes IQR. 
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